Matthew Talluto
22.02.2022
\[ pr(y|x) = pr(y) \]
\[ \mathrm{cov}_{xy} = \frac{\sum_{i=1}^n \left (x_i - \bar{x} \right) \left(y_i - \bar{y} \right)}{n-1} \]
\[ \begin{aligned} \mathrm{cov}_{xy} & = \frac{\sum_{i=1}^n \left (x_i - \bar{x} \right) \left(y_i - \bar{y} \right)}{n-1} \\ r_{xy} & = \frac{\mathrm{cov}_{xy}}{s_x s_y} \end{aligned} \]
\(H_0\): \(r = 0\)
\(H_A\): two sided (\(\rho \ne 0\)) or one-sided (\(\rho > 0\) or \(\rho < 0\))
\(r\) has a standard error:
\[ s_{r} = \sqrt{\frac{1-r^2}{n-2}} \] We can then compute a \(t\)-statistic:
\[ t = \frac{r}{s} \]
The probability that \(t > \alpha\) (i.e., use the CDF of the t distribution) is the p-value.
\[ \begin{aligned} s_{r} &= \sqrt{\frac{1-r^2}{n-2}} \\ t &= \frac{r}{s} \end{aligned} \]
\[ \begin{aligned} s_{r} &= \sqrt{\frac{1-r^2}{n-2}} \\ t &= \frac{r}{s} \end{aligned} \]
# Equivalent built-in test
# same test, with a formula instead
# cor.test(~ Sepal.Length + Petal.Length, data = iris_set, alternative = "two.sided"))
with(iris_set,
cor.test(Sepal.Length, Petal.Length, alternative = "two.sided"))
##
## Pearson's product-moment correlation
##
## data: Sepal.Length and Petal.Length
## t = 2, df = 48, p-value = 0.06
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.0121 0.5078
## sample estimates:
## cor
## 0.267cor.test and cor)prop.test) or \(\chi^2\) test (chisq.test)Heterogeneity of subgroups
cor.test(x, y)
##
## Pearson's product-moment correlation
##
## data: x and y
## t = -1, df = 148, p-value = 0.3
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.2447 0.0734
## sample estimates:
## cor
## -0.0879
cor.test(x, y, method = 'spearman')
##
## Spearman's rank correlation rho
##
## data: x and y
## S = 8e+05, p-value = 9e-06
## alternative hypothesis: true rho is not equal to 0
## sample estimates:
## rho
## -0.357Observation:
| colour | F2 |
|---|---|
| violet | 705 |
| white | 224 |
Observation:
| colour | F2 |
|---|---|
| violet | 705 |
| white | 224 |
\(H_0\): Inheritance is Mendelian (violet:white = 3:1)
\(H_A\): Inheritance is not exactly Mendelian
# Test of proportions against a null hypothesis
# For small sample sizes, use binom.test
counts = matrix(c(705, 224), ncol = 2)
prop.test(counts, p = 0.75, alternative = "two.sided")
##
## 1-sample proportions test with continuity correction
##
## data: counts, null probability 0.75
## X-squared = 0.3, df = 1, p-value = 0.6
## alternative hypothesis: true p is not equal to 0.75
## 95 percent confidence interval:
## 0.730 0.786
## sample estimates:
## p
## 0.759woodpecker = read.csv("data/woodpecker.csv")
head(woodpecker)
## forest_type nest_tree
## 1 burned birch
## 2 burned birch
## 3 burned jack pine
## 4 burned aspen
## 5 burned birch
## 6 burned jack pine
table(woodpecker)
## nest_tree
## forest_type aspen birch jack pine
## burned 6 16 2
## unburned 29 18 23We want to test for an association between the two variables (forest type and nest tree)
\(H_0\): Nesting tree is not associated with forest type \(H_A\): Nest tree is associated with forest type
table(woodpecker)
## nest_tree
## forest_type aspen birch jack pine
## burned 6 16 2
## unburned 29 18 23
table(woodpecker)/rowSums(table(woodpecker))
## nest_tree
## forest_type aspen birch jack pine
## burned 0.2500 0.6667 0.0833
## unburned 0.4143 0.2571 0.3286We want to test for an association between the two variables (forest type and nest tree)
\(H_0\): Nesting tree is not associated with forest type \(H_A\): Nest tree is associated with forest type
Plots in ggplot have 3 required components.
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1 5.1 3.5 1.4 0.2 setosa
## 2 4.9 3.0 1.4 0.2 setosa
## 3 4.7 3.2 1.3 0.2 setosa
## 4 4.6 3.1 1.5 0.2 setosa
## 5 5.0 3.6 1.4 0.2 setosa
## 6 5.4 3.9 1.7 0.4 setosa
Plots in ggplot have 3 required components.
aes functionPlots in ggplot have 3 required components.
aes functionScatterplots become less useful.
birddiv = read.csv("data/birddiv.csv")
bird_plot = ggplot(birddiv, aes(x=forest_frag,
y = richness, colour = bird_type)) +
geom_point()
head(birddiv)
## Grow.degd For.cover NDVI bird_type richness forest_frag
## 1 330 99.9 60.38 forest 8 1
## 2 330 0.0 22.88 forest 1 0
## 3 330 38.3 11.86 forest 5 3
## 4 330 11.4 19.07 forest 7 7
## 5 330 0.0 2.12 forest 2 0
## 6 170 100.0 54.03 forest 7 1Adding jitter can sometimes improve things
Better solution: boxplots
You see this a lot. When should you do it? NEVER
This is almost as bad, and sadly much more common
Barplots, or proportional bars for counts within categories
Barplots, or proportional bars for counts within categories
table(woodpecker)
## nest_tree
## forest_type aspen birch jack pine
## burned 6 16 2
## unburned 29 18 23
woodp_plot = ggplot(woodpecker, aes(x = nest_tree,
fill = forest_type))
woodp_plot = woodp_plot + geom_bar(width = 0.5,
position=position_dodge())
woodp_plot = woodp_plot + xlab("Nest Tree Type") +
theme_minimal() + labs(fill = "Forest Type")
woodp_plotCorrelation assumes nothing about the causal nature of the relationship between x and y
Often, we have a hypothesis that a variable is in some way functionally dependent on another
A simple linear regression describes/tests the relationship between
\[ \begin{aligned} \mathbb{E}(y|x) & = \hat{y} = \alpha + \beta x \\ & \approx a + bx \\ \\ \end{aligned} \]
\(y\) is not perfectly predicted by \(x\), so we must include an error (residual) term:
\[ \begin{aligned} y_i & = \hat{y_i} + \epsilon_i \\ & = a + bx_i + \epsilon_i\\ \\ \epsilon & \sim \mathcal{N}(0, s_{\epsilon}) \end{aligned} \]
\[ \begin{aligned} \hat{y_i} & = a + b x_i \\ y_i & = \hat{y_i} + \epsilon_i \\ \epsilon & \sim \mathcal{N}(0, s_{\epsilon}) \end{aligned} \]
We want to solve these equations for \(a\) and \(b\)
The “best” \(a\) and \(b\) are the ones that draw a line that is closest to the most points (i.e., that minimizes \(s_{\epsilon}\))
\[ s_{\epsilon} = \sqrt{\frac{\sum_{i=1}^n \left (y_i -\hat{y_i}\right )^2}{n-2}} \]
\[ \begin{aligned} \mathrm{ESS} & = \sum_{i=1}^n \left (y_i -\hat{y_i}\right )^2 \\ & = \sum_{i=1}^n \left (y_i - a - bx_i \right )^2 \end{aligned} \]
Solving for the minimum ESS yields:
\[ \begin{aligned} b & = \frac{\mathrm{cov_{xy}}}{s^2_x} \\ \\ & = r_{xy}\frac{s_y}{s_x}\\ \\ a & = \bar{y} - b\bar{x} \end{aligned} \]
The parameters have standard errors:
\[ s_a = s_{\hat{y}}\sqrt{\frac{1}{n} + \frac{\bar{x}^2}{\sum{(x-\bar{x}}^2)}} \]
\[ s_b = \frac{s_{\hat{y}}}{\sqrt{\sum{(x-\bar{x}}^2)}} \]
\(\mathbf{H_0}\): The slope \(\beta\) = 0 (i.e., no variation in \(y\) is explained by variation in \(x\))
The ratio of variance explained to residual variance follows an \(F\) distribution
\[\begin{aligned} F & = \frac{MS_{model}}{MS_{err}} \\ MS_{model} & = \sum_{i=1}^n \left ( \hat{y}_i - \bar{y}\right)^2 \\ MS_{err} & = \frac{\sum_{i=1}^n \left ( y_i - \hat{y}_i \right)^2}{n-1} \end{aligned}\]The coefficient of determination tells you the proportion of variance explained by the model:
\[ r^2 = \frac{\sum_{i=1}^n \left ( \hat{y_i} - \bar{y} \right )^2} {\sum_{i=1}^n \left ( y_i - \bar{y} \right )^2} \]
## Data on sparrow wing lengths from Zar (1984)
dat = data.frame(age = c(3:6, 8:12, 14:17),
wing_length = c(1.4, 1.5, 2.2, 2.4, 3.1, 3.2, 3.2, 3.9, 4.1, 4.7, 4.5, 5.2, 5.0))
mod = lm(wing_length ~ age, data = dat)
summary(mod)
##
## Call:
## lm(formula = wing_length ~ age, data = dat)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.3070 -0.2154 0.0655 0.1632 0.2251
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.7131 0.1479 4.82 0.00053 ***
## age 0.2702 0.0135 20.03 5.3e-10 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.218 on 11 degrees of freedom
## Multiple R-squared: 0.973, Adjusted R-squared: 0.971
## F-statistic: 401 on 1 and 11 DF, p-value: 5.27e-10
confint(mod)
## 2.5 % 97.5 %
## (Intercept) 0.388 1.04
## age 0.241 0.30par(mfrow=c(1, 2), bty='n')
## scale(residuals) produces **standardized** residuals
qqnorm(scale(residuals(mod)), pch=16)
qqline(scale(residuals(mod)), lty=2)
plot(dat$age, residuals(mod), xlab = "age", ylab = "residuals", pch=16)
abline(h = 0, lty=2)We found a significant positive relationship between wing length and age (\(F_{1,11} = 400\), \(p < 0.001\), \(R^2 = 0.97\); Table 1)
| estimate | st. error | 95% CI | |
|---|---|---|---|
| intercept | 0.71 | 0.150 | 0.39, 1.03 |
| age | 0.27 | 0.013 | 0.24, 0.30 |